Shotgun Metagenomic Data Analysis ◾ 305
probability that a scaffold belongs to a bin using an expectation-maximization (EM) algo-
rithm. The program also provides statistics, including genome completeness, GC-content,
and genome size. Figure 8.1 shows the steps from sequencing to binning.
8.2 SHOTGUN METAGENOMIC ANALYSIS WORKFLOW
The first two steps of the metagenomic data analysis workflow are raw data acquisition and
quality control. After the quality control, the raw data can pass through two different steps:
(i) de novo assembly and subsequent analysis and (ii) assembly-free analysis. In the follow-
ing, we discuss these steps with a worked example.
8.2.1 Data Acquisition
The raw shotgun metagenomic data is sequences of the metagenomic DNA extracted from
either environmental or clinical samples which usually contain several species of microbes.
Depending on the sequencing technology, data can be short reads produced by Illumina
and other short-read sequencing technologies or long reads produced either by Pacific
Bioscience (PacBio) or by Oxford Nanopore Technology (ONT). The read layout can also be
single end or paired end. The raw data is usually provided in FASTQ files. Many research-
ers uploaded their raw data to a database like NCBI SRA and make it available for public.
We will download FASTQ files from the NCBI SRA data for the purpose of demonstrat-
ing how analysis is conducted. The run numbers are “ERR1823587”, “ERR1823601”, and
“ERR1823608” which contain shotgun metagenomic data of human stool samples from a
healthy, a moderate, and a severe sickle cell disease patient, respectively. We will create the
directory “shotgun” as the project working directory; then, we will use the SRA-toolkits
“fasterq-dump” utility to download the paired-end files in a directory called “fastqdir”.
mkdir shotgun; cd shotgun
fasterq-dump --threads 4 --verbose --outdir fastqdir ERR1823587
fasterq-dump --threads 4 --verbose --outdir fastqdir ERR1823601
fasterq-dump --threads 4 --verbose --outdir fastqdir ERR1823608
Six FASTQ files will be saved in the “fastqdir”. Use “ls fastqdir/” to display the content of
that directory to make sure the files are there.
8.2.2 Quality Assessment and Processing
If you obtained these files directly from the sequencer, it is likely that they may need quality
control, which includes both quality assessments using one of the quality assessment pro-
grams like FastQC and processing to filter out the low-quality reads, to trim low-quality
Sequencing
Reads
Assembly
Binning
Bin 3
Bin 1
Bin 2
Bin 4
Contigs
FIGURE 8.1 Reads processing from sequencing to bin formation.